Download and read training and test data into R and prepare graphical and numerical summaries of it: e.g. histograms of continuous attributes, contingency tables of categorical variables, scatterplots of continuous attributes with some of the categorical variables indicated by color/symbol shape, etc. Whatever you find helpful to think about properties of the data you are about to start using for fitting classification models.
Taking a quick look at the data, we see from a table of the outcome “noyes” that there are approximately 3 times more no’s as yes’s. For the predictors, there are 6 continous, 1 dummy, and 9 categorical variables.
Histograms of the 6 continuous attributes are shown below for the two datasets, and the distributions look fairly similar for the training and test datasets. The majority of the variables appear right skewed, except for vuu which appears left skewed.
To get a general sense of the data, some pairwise contingency tables of the outcome with the predictors were examined. Some interesting things we can see from the tables are that “cle” having a value of fxb and “kbc” having values of fw or hh are highly predictive of “noyes” being yes. There is a preference for “noyes” being no when “gb” is 0, “xxp” is ndk, and “rao” is e2. On the other hand “at” seems to be evenly distributed in a 3:1 ratio for no vs. yes for each of its categories.
Scatterplots of the continuous attributes are shown below (for a subset of rows of data so that the points are easier to discern) and colored by some of the categorical variables as well as the outcome. It’s difficult to see any obvious patterns in the bulk of the data, but there are some patterns that can be observed for the more extreme points. For example, “vuu” being 0 corresponds highly with “og” being kc and “noyes” being no, while high values of “ihj” and “ptj” correspond with “noyes” being yes.
set.seed(121)
# Load data
trainData <- read.csv("final-data-train.csv")
testData <- read.csv("final-data-test.csv")
# Examine data (commented out to save space)
# str(trainData) # 16 predictors: 6 continous, 1 dummy, 9 categorical
# str(testData)
table(trainData$noyes)
##
## no yes
## 23802 7753
# Change dummy variable gb to factor
trainData$gb = factor(trainData$gb)
testData$gb = factor(testData$gb)
# Combine the predictors of the two datasets for exploratory analysis
combData = rbind(cbind(trainData[,-1], trainTest="Train"), cbind(testData, trainTest="Test"))
# Continuous attributes
contAtt = c("vuu", "mt", "zq", "zf", "ihj", "ptj")
# Histograms of continuous attributes
plots <- lapply(contAtt, function(i) { ggplot(combData, aes(x=get(i), fill=trainTest)) +
xlab(i) +
ggtitle(paste0("Histogram of ",i)) +
geom_histogram() })
do.call("grid.arrange", c(plots, ncol=3))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Dummy and categorical attributes
catAtt = c("bdb", "cle", "kbc", "og", "gb", "xxp", "rao", "at", "lf", "fq","noyes")
# Contingency table of noyes with some categorical attributes
for ( i in 1:9 ) {
print(catAtt[i])
print(table(trainData[, "noyes"], trainData[, catAtt[i]]))
}
## [1] "bdb"
##
## be cu dt eb mf rh rz xy
## no 609 5995 2410 1751 1266 2769 537 8465
## yes 301 1623 895 807 545 1170 287 2125
## [1] "cle"
##
## afs cqy fpj fxb rmv
## no 1188 4570 17836 2 206
## yes 373 1660 5469 23 228
## [1] "kbc"
##
## fv fw hh kd ld ln mo sv uz wg wh wr zl
## no 279 8 0 780 727 13 49 3446 2364 101 9183 148 6704
## yes 85 25 20 237 410 53 105 503 573 344 1422 558 3418
## [1] "og"
##
## je kc
## no 4816 18986
## yes 1518 6235
## [1] "gb"
##
## 0 1
## no 20281 3521
## yes 4475 3278
## [1] "xxp"
##
## ndk ptn
## no 22428 1374
## yes 5919 1834
## [1] "rao"
##
## e1 e2 e3 e4 e5 e6 e7 e8
## no 41 23219 149 37 49 131 16 160
## yes 248 5805 180 174 169 520 34 623
## [1] "at"
##
## ghz pxe urz zgp
## no 5458 1490 10273 6581
## yes 1649 491 3369 2244
## [1] "lf"
##
## df ha ii rw wp xp
## no 14 18 269 8667 693 14141
## yes 25 77 842 2096 161 4552
# Contingency table of some categorical attributes with each other
for ( i in seq(1,7,2) ) {
print(paste(catAtt[i]," and ",catAtt[i+1]))
print(table(trainData[,i], trainData[, catAtt[i+1]]))
}
## [1] "bdb and cle"
##
## afs cqy fpj fxb rmv
## no 1188 4570 17836 2 206
## yes 373 1660 5469 23 228
## [1] "kbc and og"
##
## je kc
## be 198 712
## cu 1517 6101
## dt 631 2674
## eb 524 2034
## mf 366 1445
## rh 763 3176
## rz 117 707
## xy 2218 8372
## [1] "gb and xxp"
##
## ndk ptn
## fv 343 21
## fw 29 4
## hh 10 10
## kd 890 127
## ld 1062 75
## ln 60 6
## mo 126 28
## sv 3727 222
## uz 2707 230
## wg 381 64
## wh 9662 943
## wr 498 208
## zl 8852 1270
## [1] "rao and at"
##
## ghz pxe urz zgp
## 0 182 42 368 251
## 1 86 23 196 104
## 1.41 94 26 172 110
## 1.73 152 27 279 189
## 2 177 40 312 218
## 2.24 253 67 517 335
## 2.45 261 64 512 355
## 2.65 397 109 719 523
## 2.83 564 158 1172 738
## 3 589 160 1141 740
## 3.16 752 214 1465 906
## 3.32 797 223 1528 984
## 3.46 887 270 1674 1057
## 3.61 1283 352 2408 1511
## 3.74 633 206 1179 804
# Pairwise plots of continuous attributes with various categorical attributes colored for training data
nTrain = nrow(trainData)
rowsTrain = seq(1, nTrain, 50)
for ( i in c(1,4,7,11) ) {
pairs(sapply(trainData[rowsTrain, contAtt], jitter), col=trainData[rowsTrain, catAtt[i]], pch=16, oma=c(3,3,5,10), cex=0.4, main=paste("Pairs plot colored by", catAtt[i], "for training data"))
oldPar <- par(xpd = TRUE)
legend("topright", pch=16, col = 1:length(levels(trainData[rowsTrain, catAtt[i]])), legend = c(levels(trainData[rowsTrain, catAtt[i]])))
par(oldPar)
}
# Pairwise plots of continuous attributes with various categorical attributes colored for test data
nTest = nrow(testData)
rowsTest = seq(1, nTest, 50)
for ( i in c(1,4,7) ) {
pairs(sapply(testData[rowsTest, contAtt], jitter), col=testData[rowsTest, catAtt[i]], pch=16, oma=c(3,3,5,10), cex=0.4, main = paste("Pairs plot colored by", catAtt[i], "for testing data"))
oldPar <- par(xpd = TRUE)
legend("topright", pch=16, col = 1:length(levels(testData[rowsTest, catAtt[i]])), legend = c(levels(testData[rowsTest, catAtt[i]])))
par(oldPar)
}
As it is often the case for such contests, the attributes in the dataset are blinded in the sense that no information is available about what those are or what their values mean. The only information available is that the attribute noyes is the outcome to be modeled and the attribute id is the unique numerical identifier for each observation. Some of the remaining attributes are clearly categorical (those that are character valued) and some rather obviously continuous (those with numerical values with large number of unique values). For several of them it is less clear whether it is best to treat them as continuous or categorical – e.g. their values are numerical but there are relatively few unique values with many observations taking the same value, so that they arguably could be treated as continuous or categorical. Please idenify them, reflect on how you prefer to handle them and describe this in your own words.
“gb” is a clear numeric category that should be treated as categorical since it only has integer values of 0 and 1. As seen from the code below, even though some of the other numeric attributes clearly have discrete values, the ones with the fewest are “vuu” with 15 and “mt” with 19. Looking closer at these variables with the histograms and pairwise scatterplots above, I would still consider them as continous variables since there is still enough distribution and information to be gained from their numeric values and there are no clear cutoffs I can see for dividing the variables into categorical variables. If, for example, there is a clear difference when “vuu” has value of 0 vs. values of 1 or more, it might make sense to change it into a categorical variable with a cutoff between 0 and 1. However, the trends seem to be more subtle, and at this time I think it is better to keep the ordering information from the actual numeric values of the data.
# Examine how many unique values for each continous variable
for (i in 1:ncol(trainData)) {
if (class(trainData[,i]) == "numeric") {
print(paste0(names(trainData)[i], ": ", length(unique(trainData[,i]))))
}
}
## [1] "vuu: 15"
## [1] "mt: 19"
## [1] "zq: 31550"
## [1] "zf: 659"
## [1] "ihj: 111"
## [1] "ptj: 31555"
Perform principal components analysis of this data (do you need to scale it prior to that? how would you represent multilevel categorical attributes to be used as inputs for PCA?) and plot observations in the space of the first few principal components indicating levels of some of the categorical attributes of your choosing by the color/shape of the symbol. Perform univariate assessment of associations between the outcome we will be modeling and each of the attributes (e.g. t-test or logistic regression for continuous attributes, contingency tables/Fisher exact test/\(\chi^2\) test for categorical attributes). Summarize your observations from these assessments: does it appear that there are predictors associated with the outcome noyes univariately? Which predictors seem to be more/less relevant?
Since the numerical range of the various attributes are different, with some going from 0 to 1 but others going up to 18, it would be best to scale the data prior to PCA. The “id” column is not being considered because it seems to be a unique identifier for each row but the value does not seem to be related to the outcome.
There are two main ways to represent multilevel categorical attributes as inputs for PCA: as individual dummy variables for each level, or as different shapes/colors as level indicators on plots after PCA of continuous attributes are performed to visually see if there is any clustering by the categorical attributes. I decided to combine the two approaches so that categorical attributes with only 2 levels are presented as dummy variables (“og”, “gb”, and “xxp”), while using visualization of some of the remaining categorical attributes along the principal component axes to see if there are any clustering by category.
The attributes with largest loadings for the first two principal components were “ptj” and “vuu”. Some plots of observations in the space of the first few principal components with some of the categorical attributes indicated by color are shown below. We see that there is not super clear separation between categories, but, for example, for low values of PC1 there tends to be more no’s for “noyes”, more fqj’s for “cle”, and more kc’s for “og”.
For univariate assessments, logistic regression was performed for continuous attributes, while Fisher’s exact tests and \(\chi^2\) tests were performed for categorical attributes (to save space, all but one example from each are commented out). From the logistic regressions, all continuous attributes except for “zq” (“vuu”, “mt”, “zf”, “ihj”, and “ptj”) were significant with extremely low p-values of below 2e-16. For categorical attributes with only 2 levels for which we used Fisher’s exact test, we found that “og” was not significant, while “gb” and “xxp” were (p-values below 2.2e-16). For the remaining categorical attributes for which we used \(\chi^2\) tests, we found that “at” was less significant (p-value around 0.01) while the others were highly significant (p-values below 2.2e-16).
# Structure data for PCA (change 2 category variables into dummy variables)
trainData.PCA = trainData
trainData.PCA$og.kc = trainData$og == "kc"
trainData.PCA$gb.1 = trainData$gb == 1
trainData.PCA$xxp.ptn = trainData$xxp == "ptn"
trainData.PCA = trainData.PCA[, c(1,3:5,7,10:21)]
# Principal components analysis of continuous attributes (scaled and without outcome)
PC = prcomp(trainData.PCA[, c(5,8,10:13,15:17)], scal=TRUE)
# Scree plot of PCA results
# plot(scaledPCs)
# Attributes with largest loadings for the first and second principal components
# PCA1 = sort(abs(PC$rotation[,1]))[9]; PCA1 # ptj
# PCA2 = sort(abs(PC$rotation[,2]))[9]; PCA2 # vuu
# Plot of the first few principal components
# biplot(PC, cex=0.5)
oldpar <- par(mfrow=c(4,3), xpd = TRUE)
for (i in c(1,3,9,15)) {
trainData.PCA[,i]=factor(trainData.PCA[,i])
plot(PC$x[rowsTrain,1], PC$x[rowsTrain,2], xlab="PC1", ylab="PC2", col=trainData.PCA[rowsTrain,i], pch=16, cex=0.5)
plot(PC$x[rowsTrain,2], PC$x[rowsTrain,3], xlab="PC2", ylab="PC3", col=trainData.PCA[rowsTrain,i], pch=16, cex=0.5, main=names(trainData.PCA)[i])
plot(PC$x[rowsTrain,1], PC$x[rowsTrain,3], xlab="PC1", ylab="PC3", col=trainData.PCA[rowsTrain,i], oma=c(3,3,5,10), pch=16, cex=0.5)
legend("topright", pch=16, col = 1:length(levels(trainData.PCA[rowsTrain, i])), legend = c(levels(trainData.PCA[rowsTrain, i])))
}
par(oldpar)
# Logistic fits of continuous variables
fit.vuu = glm(noyes~vuu, data=trainData, family=binomial)
summary(fit.vuu)
##
## Call:
## glm(formula = noyes ~ vuu, family = binomial, data = trainData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8936 -0.8275 -0.6914 -0.2872 2.5334
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.16787 0.07865 -40.28 <2e-16 ***
## vuu 0.65666 0.02426 27.07 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 35188 on 31554 degrees of freedom
## Residual deviance: 34245 on 31553 degrees of freedom
## AIC: 34249
##
## Number of Fisher Scoring iterations: 4
# fit.mt = glm(noyes~mt, data=trainData, family=binomial)
# summary(fit.mt)
# fit.zq = glm(noyes~zq, data=trainData, family=binomial)
# summary(fit.zq) # Not significant
# fit.zf = glm(noyes~zf, data=trainData, family=binomial)
# summary(fit.zf)
# fit.ihj = glm(noyes~ihj, data=trainData, family=binomial)
# summary(fit.ihj)
# fit.ptj = glm(noyes~ptj, data=trainData, family=binomial)
# summary(fit.ptj)
# Fisher's exact tests of categorical variables with 2 values
fisher.test(table(trainData[,1], trainData$og)) # Not significant
##
## Fisher's Exact Test for Count Data
##
## data: table(trainData[, 1], trainData$og)
## p-value = 0.2148
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9765304 1.1119573
## sample estimates:
## odds ratio
## 1.041881
# fisher.test(table(trainData[,1], trainData$gb))
# fisher.test(table(trainData[,1], trainData$xxp))
# Chi squared tests for remaining categorical variables
chisq.test(table(trainData[,1], trainData$bdb))
##
## Pearson's Chi-squared test
##
## data: table(trainData[, 1], trainData$bdb)
## X-squared = 406.33, df = 7, p-value < 2.2e-16
# chisq.test(table(trainData[,1], trainData$cle))
# chisq.test(table(trainData[,1], trainData$kbc))
# chisq.test(table(trainData[,1], trainData$rao))
# chisq.test(table(trainData[,1], trainData$at)) # Less significant
# chisq.test(table(trainData[,1], trainData$lf))
# chisq.test(table(trainData[,1], trainData$fq))
# Helper function from lecture to assess predictions
assess.prediction = function(truth, predicted, verbose=T) {
# Remove missing values
predicted = predicted[!is.na(truth)]
predicted = predicted[!is.na(predicted)]
truth = truth[!is.na(truth)]
truth = truth[!is.na(predicted)]
# True/false positives/negatives
TP = sum(truth=="yes" & predicted=="yes")
TN = sum(truth=="no" & predicted=="no")
FP = sum(truth=="no" & predicted=="yes")
FN = sum(truth=="yes" & predicted=="no")
P = TP+FN # total postives in truth data
N = TN+FP # total negatives in truth data
# All results
results = list(correct=sum(truth==predicted), accuracy=signif(sum(truth==predicted)*100/length(truth),3) , error=signif(100-sum(truth==predicted)*100/length(truth),3), TPR=signif(100*TP/P, 3), TNR=signif(100*TN/N, 3), PPV=signif(100*TP/(TP+FP),3), FDR=signif(100*FP/(TP+FP),3))
# Don't print information if verbose is False
if ( verbose == F ) {
return(results)
}
# Total cases
cat("Total cases that are not NA: ", length(truth), "\n", sep ="")
# Overall accuracy
cat("Correct predictions (accuracy): ", results$correct, " (", results$accuracy, "%)\n", sep ="")
cat("Test error=1-accuracy: ", results$error, "%\n", sep ="")
# Sensitivity/specificity/etc.
cat("TPR (sensitivity)=TP/P: ", results$TPR, "%\n", sep ="")
cat("TNR (specificity)=TN/N: ", results$TNR, "%\n", sep ="")
cat("PPV (precision)=TP/(TP+FP): ", results$PPV, "%\n", sep ="")
cat("FDR (false discovery)=1-PPV: ", results$FDR, "%\n", sep ="")
}
Develop logistic regression model of the outcome noyes as a function of multiple predictors in the model. Which variables are significantly associated with the outcome? Test model performance on multiple splits of data into training and test subsets and summarize it in terms of accuracy/error/sensitivity/specificity.
Initially, logistic regression was tested on just the continuous variables (besides “id”) and suprisingly even though “vuu” was univariately significant it was the least significant in the case of multiple regression. After backwards stepwise, removal of “zq” was also performed before obstaining a set of significant continuous variables, which had 79.3% training accuracy. We then added in the categorical variables and tried this process again and this time ended up removing “zq” followed by “og”, with “vuu” remaining significant. Having more variables resulted in a training accuracy of 86.7%. To test stability of this model, we performed 5-fold crossvalidation and obtained similar results, with medians of around 86.7% accuracy, 13.3% error, 60% sensitivity, and 95% specificity as seen in the boxplots below.
# True values
true = trainData$noyes
# Fit logistic regression model on continuous variables
fit.log.cont = glm(noyes~vuu+mt+zq+zf+ihj+ptj, data=trainData, family=binomial)
# summary(fit.log.cont)
predicted1 = ifelse(predict(fit.log.cont, newdata=trainData[ , 3:18], type="response") > 0.5, "yes", "no")
# assess.prediction(true, predicted1) # Assess prediction => 79.3% accuracy
# Logistic regression model with only significant continuous variables (remove vuu then zq using backwards stepwise)
fit.log.cont2 = glm(noyes~mt+zf+ihj+ptj, data=trainData, family=binomial)
# summary(fit.log.cont2)
predicted2 = ifelse(predict(fit.log.cont2, newdata=trainData[ , 3:18], type="response") > 0.5, "yes", "no")
assess.prediction(true, predicted2) # Assess prediction => 79.3% accuracy
## Total cases that are not NA: 31555
## Correct predictions (accuracy): 25022 (79.3%)
## Test error=1-accuracy: 20.7%
## TPR (sensitivity)=TP/P: 27.3%
## TNR (specificity)=TN/N: 96.2%
## PPV (precision)=TP/(TP+FP): 70.2%
## FDR (false discovery)=1-PPV: 29.8%
# Fit logistic regression model on all variables
fit.log.all = glm(noyes~.-id, data=trainData, family=binomial)
# summary(fit.log.all)
predicted3 = ifelse(predict(fit.log.all, newdata=trainData[ , 2:18], type="response") > 0.5, "yes", "no")
# assess.prediction(true, predicted3) # Assess prediction => 86.7% accuracy
# Logistic regression model with only significant variables (remove zq then og using backwards stepwise)
fit.log.final = glm(noyes~.-id-zq-og, data=trainData, family=binomial)
# summary(fit.log.final)
# Compare true and predicted values
predicted.log.final = ifelse(predict(fit.log.final, newdata=trainData[ , 2:18], type="response") > 0.5, "yes", "no")
table(true, predicted.log.final)
## predicted.log.final
## true no yes
## no 22659 1143
## yes 3051 4702
# Accuracy, error rate, sensitivity, and specificity of final model
assess.prediction(true, predicted.log.final) # Assess prediction => 86.7% accuracy, 60.6% TPR, and 95.2% TNR
## Total cases that are not NA: 31555
## Correct predictions (accuracy): 27361 (86.7%)
## Test error=1-accuracy: 13.3%
## TPR (sensitivity)=TP/P: 60.6%
## TNR (specificity)=TN/N: 95.2%
## PPV (precision)=TP/(TP+FP): 80.4%
## FDR (false discovery)=1-PPV: 19.6%
# Test model on train/test splits
# k-fold crossvalidation
N.xval = 5
N.iter = 3
data=trainData
results = NULL
for ( iter in 1:N.iter ) {
grps = sample( (1:nrow(data)) %% N.xval+1 ) # Randomly assign groups
for ( i in 1:N.xval ) {
data.test = data[grps == i,] # Set group i as test
data.train = data[grps != i,] # Set the rest as train
# Expected results from train/test predictions
train.true = data.train$noyes
test.true = data.test$noyes
# Results for fit
fit = glm(noyes~.-id-zq-og, data=data.train, family=binomial)
train.predicted = ifelse(predict(fit, newdata=data.train[ , 2:18], type="response") > 0.5, "yes", "no")
test.predicted = ifelse(predict(fit, newdata=data.test[ , 2:18], type="response") > 0.5, "yes", "no")
# Calculate and store accuracy/error/sensitivity/specificity
test.results = assess.prediction(test.true, test.predicted, verbose=F)
results = rbind(results, data.frame(accuracy = test.results$accuracy, error = test.results$error,
sensitivity = test.results$TPR, specificity = test.results$TNR, model="Logistic"))
}
}
# Plot results from cross-validation
plot1 = ggplot(results, aes(y=accuracy)) + geom_boxplot() + ylab("Accuracy (%)") +
ggtitle("Accuracy of Logistic Regression") + theme(plot.title = element_text(size=12, hjust = 0.5))
plot2 = ggplot(results, aes(y=error)) + geom_boxplot() + ylab("Error (%)") +
ggtitle("Error of Logistic Regression") + theme(plot.title = element_text(size=12, hjust = 0.5))
plot3 = ggplot(results, aes(y=sensitivity)) + geom_boxplot() + ylab("Sensitivity (%)") +
ggtitle("Sensitivity of Logistic Regression") + theme(plot.title = element_text(size=12, hjust = 0.5))
plot4 = ggplot(results, aes(y=specificity)) + geom_boxplot() + ylab("Specificity (%)") +
ggtitle("Specificity of Logistic Regression") + theme(plot.title = element_text(size=12, hjust = 0.5))
grid.arrange(plot1, plot2, plot3, plot4, nrow=2)
Assess the impact/significance of pairwise interaction terms for all pairwise combinations of covariates used in the model and report the top ten that most significantly improve model fit.
Fit linear discriminant analysis model of the outcome noyes as a function of the rest of covariates in the dataset. Feel free to decide whether you want to use all of them or a subset of those. Test resulting model performance on multiple splits of the data into training and test subsets, summarize it in terms of accuracy/error/sensitivity/specificity and compare them to those obtained for logistic regression.
I decided to fit the LDA classifier on the same subset of variables as was found for logistic regression (all variables except for “id”, “zq”, and “og”), and it was found that there was no improvement in training accuracy with or without “zq” and “og”. Cross-validation was performed on multiple iterations of training and test subsets, and the accuracy/error/sensitivity/specificity are shown in boxplots below. Overall, the error rate for LDA was quite similar but slightly worse than for the logistic regression above. It tended to have a slightly higher specificity, so there was a corresponding higher decline in sensitivity. From the graphs, we see that LDA has around 86.2% accuracy, 13.8% error, 58% sensitivity, and 95% specificity as seen in the boxplots below.
# Fit LDA classifier on all variables
fit.LDA.all = lda(noyes~.-id, data=trainData)
predicted4 = ifelse(predict(fit.LDA.all, newdata=trainData[ , 2:18])$posterior[,2] > 0.5, "yes", "no")
# assess.prediction(true, predicted4) # Assess prediction => 86.4% accuracy
# LDA classifier with only significant variables as determined from logisitic regression (remove zq and og)
fit.LDA.final = lda(noyes~.-id-zq-og, data=trainData)
predicted.LDA.final = ifelse(predict(fit.LDA.final, newdata=trainData[ , 2:18])$posterior[,2] > 0.5, "yes", "no")
table(true, predicted.LDA.final)
## predicted.LDA.final
## true no yes
## no 22690 1112
## yes 3177 4576
# Accuracy, error rate, sensitivity, and specificity of final model
assess.prediction(true, predicted.LDA.final) # Assess prediction => 86.4% accuracy, 59% TPR, and 95.3% TNR
## Total cases that are not NA: 31555
## Correct predictions (accuracy): 27266 (86.4%)
## Test error=1-accuracy: 13.6%
## TPR (sensitivity)=TP/P: 59%
## TNR (specificity)=TN/N: 95.3%
## PPV (precision)=TP/(TP+FP): 80.5%
## FDR (false discovery)=1-PPV: 19.5%
# Test model on train/test splits
# k-fold crossvalidation
for ( iter in 1:N.iter ) {
grps = sample( (1:nrow(data)) %% N.xval+1 ) # Randomly assign groups
for ( i in 1:N.xval ) {
data.test = data[grps == i,] # Set group i as test
data.train = data[grps != i,] # Set the rest as train
# Expected results from train/test predictions
train.true = data.train$noyes
test.true = data.test$noyes
# Results for fit
fit = lda(noyes~.-id-zq-og, data=data.train)
train.predicted = ifelse(predict(fit, newdata=data.train[ , 2:18])$posterior[,2] > 0.5, "yes", "no")
test.predicted = ifelse(predict(fit, newdata=data.test[ , 2:18])$posterior[,2] > 0.5, "yes", "no")
# Calculate and store accuracy/error/sensitivity/specificity
test.results = assess.prediction(test.true, test.predicted, verbose=F)
results = rbind(results, data.frame(accuracy = test.results$accuracy, error = test.results$error,
sensitivity = test.results$TPR, specificity = test.results$TNR, model="LDA"))
}
}
# Plot results from cross-validation
plot1 = ggplot(results, aes(y=accuracy, col=model)) + geom_boxplot() + ylab("Accuracy (%)") +
ggtitle("Accuracy of LDA") + theme(plot.title = element_text(size=12, hjust = 0.5))
plot2 = ggplot(results, aes(y=error, col=model)) + geom_boxplot() + ylab("Error (%)") +
ggtitle("Error of LDA") + theme(plot.title = element_text(size=12, hjust = 0.5))
plot3 = ggplot(results, aes(y=sensitivity, col=model)) + geom_boxplot() + ylab("Sensitivity (%)") +
ggtitle("Sensitivity of LDA") + theme(plot.title = element_text(size=12, hjust = 0.5))
plot4 = ggplot(results, aes(y=specificity, col=model)) + geom_boxplot() + ylab("Specificity (%)") +
ggtitle("Specificity of LDA") + theme(plot.title = element_text(size=12, hjust = 0.5))
grid.arrange(plot1, plot2, plot3, plot4, nrow=2)
In our experience attempting to fit quadratic discriminant analysis model of the categorical outcome noyes on this data results in a rank deficiency related error. Determine on how to correct this error and report resulting model training and test error/accuracy/etc. and how it compares to LDA and logistic regression above.
The error can be corrected if some variables are removed to correct for some variables potentially being linear combinations of others. I used GVIF^(1/(2*Df)) from the logistic fit as a guideline and added back variables one by one starting with all the continuous variables then the remaining categorical variables until just before errors with rank deficiency occured. At this point, the variables that were removed were “kbc” and “og”, which resulted in a training accuracy of 83.7%. Unlike previously with the banknote data, QDA performed worse than LDA. Cross-validation was performed on multiple iterations of training and test subsets, and the accuracy/error/sensitivity/specificity are shown in boxplots below, with error dropping to 83%, error going up to 17%, sensivity down to 47%, and specificity down to around 94%.
# Examine VIFs from logistic fit
vif(fit.log.all)
## GVIF Df GVIF^(1/(2*Df))
## bdb 1.346752 7 1.021492
## cle 1.115740 4 1.013784
## kbc 3.517348 12 1.053802
## og 1.005065 1 1.002529
## vuu 2.114722 1 1.454208
## gb 1.132995 1 1.064422
## xxp 1.052256 1 1.025795
## rao 1.127089 7 1.008582
## at 1.014673 3 1.002431
## mt 1.152804 1 1.073687
## lf 2.900867 5 1.112379
## zq 1.003013 1 1.001506
## zf 1.758990 1 1.326269
## ihj 2.968857 1 1.723037
## ptj 3.831484 1 1.957418
## fq 1.377118 3 1.054780
# QDA classifier with variables from LDA (remove zq and og), and also kbc removed since it has pretty high GVIF and the highest Df
fit.QDA.final = qda(noyes~.-id-og-kbc, data=trainData)
predicted5 = ifelse(predict(fit.QDA.final, newdata=trainData[ , -1])$posterior[,2] > 0.5, "yes", "no")
# assess.prediction(true, predicted5) # Assess prediction => 82.7% accuracy, 47.5% sensitivity, 94.2% specificity
# Test model on train/test splits
# k-fold crossvalidation
for ( iter in 1:N.iter ) {
grps = sample( (1:nrow(data)) %% N.xval+1 ) # Randomly assign groups
for ( i in 1:N.xval ) {
data.test = data[grps == i,] # Set group i as test
data.train = data[grps != i,] # Set the rest as train
# Expected results from train/test predictions
train.true = data.train$noyes
test.true = data.test$noyes
# Results for fit
fit = qda(noyes~.-id-og-kbc, data=data.train)
train.predicted = ifelse(predict(fit, newdata=data.train[ , 2:18])$posterior[,2] > 0.5, "yes", "no")
test.predicted = ifelse(predict(fit, newdata=data.test[ , 2:18])$posterior[,2] > 0.5, "yes", "no")
# Calculate and store accuracy/error/sensitivity/specificity
test.results = assess.prediction(test.true, test.predicted, verbose=F)
results = rbind(results, data.frame(accuracy = test.results$accuracy, error = test.results$error,
sensitivity = test.results$TPR, specificity = test.results$TNR, model="QDA"))
}
}
# Plot results from cross-validation
plot1 = ggplot(results, aes(y=accuracy, col=model)) + geom_boxplot() + ylab("Accuracy (%)") +
ggtitle("Accuracy of QDA") + theme(plot.title = element_text(size=12, hjust = 0.5))
plot2 = ggplot(results, aes(y=error, col=model)) + geom_boxplot() + ylab("Error (%)") +
ggtitle("Error of QDA") + theme(plot.title = element_text(size=12, hjust = 0.5))
plot3 = ggplot(results, aes(y=sensitivity, col=model)) + geom_boxplot() + ylab("Sensitivity (%)") +
ggtitle("Sensitivity of QDA") + theme(plot.title = element_text(size=12, hjust = 0.5))
plot4 = ggplot(results, aes(y=specificity, col=model)) + geom_boxplot() + ylab("Specificity (%)") +
ggtitle("Specificity of QDA") + theme(plot.title = element_text(size=12, hjust = 0.5))
grid.arrange(plot1, plot2, plot3, plot4, nrow=2)
Develop random forest model of outcome noyes. Present variable importance plots and comment on relative importance of different attributes in the model. Did attributes showing up as more important in random forest model also appear as significantly associated with the outcome by logistic regression? Test model performance on multiple splits of data into training and test subsets, compare test and out-of-bag error estimates, summarize model performance in terms of accuracy/error/sensitivity/specificity and compare to the performance of logistic regression and LDA models above.
From the variable importance plot after fitting random forest to all the variables (besides “id”), we find that “og” had the lowest value for MeanDecreaseGini, so we removed the variable from the fit. It is interesting that even though we previously got rid of “zq” for logistic regression because it was not significant, in the RF model it is quite important and ranked fourth. Otherwise, attributes showing up as more important overlap with logistic regression. Model performance was tested on multiple splits of data using 5-fold crossvalidation (choosing from a subset of 1/10 of the data to save computation time). Based on the boxplots below, we find that random forest has comparable accuracy/error to the logistic regression and LDA models above (maybe around 0.2% better based on median), but improves on the sensitivity by around 5% even though the specificity dropped by around 1%. The larger distribution of errors is likely from subsetting the data instead of using the whole dataset, but we see it has the potential to achieve accuracies as high as 90%, which would be better than the other models. The out-of-bag error estimates for the fit against all the variables vs. the fit with “id” and “og” removed were 11.53% vs. 11.61%. This is in line with the error rates we get from crossvalidation, which was around 13%.
# RF classifier with all variables (except id)
x.all = trainData[, -c(1,2)]; y=trainData$noyes
rf.all <- randomForest(x.all, y)
# predicted.rf.all = predict(rf.all, newdata=x.all)
# assess.prediction(true, predicted.rf.all) # Assess prediction => 100% accuracy
# Variable importance plot of RF model
varImpPlot(rf.all, type=2)
# Out-of-bag error estimate
# rf.all # => OOB error of 11.53%
# RF classifier with reduced variables as determined from importance plot (remove id and og)
x.reduced = trainData[, -c(1,2,6)]
rf.reduced <- randomForest(x.reduced, y)
predicted.rf.reduced = predict(rf.reduced, newdata=x.reduced)
table(true, predicted.rf.reduced)
## predicted.rf.reduced
## true no yes
## no 23802 0
## yes 105 7648
# Variable importance plot of updated RF model
varImpPlot(rf.reduced, type=2)
# Out-of-bag error estimate
rf.reduced # => OOB error of 11.61%
##
## Call:
## randomForest(x = x.reduced, y = y)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 11.56%
## Confusion matrix:
## no yes class.error
## no 22201 1601 0.06726326
## yes 2046 5707 0.26389785
# Accuracy, error rate, sensitivity, and specificity of model
assess.prediction(true, predicted.rf.reduced) # Assess prediction => 100% accuracy
## Total cases that are not NA: 31555
## Correct predictions (accuracy): 31450 (99.7%)
## Test error=1-accuracy: 0.333%
## TPR (sensitivity)=TP/P: 98.6%
## TNR (specificity)=TN/N: 100%
## PPV (precision)=TP/(TP+FP): 100%
## FDR (false discovery)=1-PPV: 0%
# Test model on train/test splits
# k-fold crossvalidation
N.xval = 5
N.iter = 5
for ( iter in 1:N.iter ) {
subset = sample( (1:nrow(data)) %% 10 ) # Randomly assign subset with 1/10 of data for faster runtime
data.subset = data[subset == 5,]
grps = sample( (1:nrow(data.subset)) %% N.xval+1 ) # Randomly assign groups
for ( i in 1:N.xval ) {
data.test = data.subset[grps == i,] # Set group i as test
data.train = data.subset[grps != i,] # Set the rest as train
# Expected results from train/test predictions
train.true = data.train$noyes
test.true = data.test$noyes
# Results for fit
fit = randomForest(data.train[, -c(1,2,6)], train.true)
train.predicted = predict(fit, newdata=data.train[, -c(1,2,6)])
test.predicted = predict(fit, newdata=data.test[, -c(1,2,6)])
# Calculate and store accuracy/error/sensitivity/specificity
test.results = assess.prediction(test.true, test.predicted, verbose=F)
results = rbind(results, data.frame(accuracy = test.results$accuracy, error = test.results$error,
sensitivity = test.results$TPR, specificity = test.results$TNR, model="RF"))
}
}
# Plot results from cross-validation
plot1 = ggplot(results, aes(y=accuracy, col=model)) + geom_boxplot() + ylab("Accuracy (%)") +
ggtitle("Accuracy of Random Forest") + theme(plot.title = element_text(size=12, hjust = 0.5))
plot2 = ggplot(results, aes(y=error, col=model)) + geom_boxplot() + ylab("Error (%)") +
ggtitle("Error of Random Forest") + theme(plot.title = element_text(size=12, hjust = 0.5))
plot3 = ggplot(results, aes(y=sensitivity, col=model)) + geom_boxplot() + ylab("Sensitivity (%)") +
ggtitle("Sensitivity of Random Forest") + theme(plot.title = element_text(size=12, hjust = 0.5))
plot4 = ggplot(results, aes(y=specificity, col=model)) + geom_boxplot() + ylab("Specificity (%)") +
ggtitle("Specificity of Random Forest") + theme(plot.title = element_text(size=12, hjust = 0.5))
grid.arrange(plot1, plot2, plot3, plot4, nrow=2)
Develop SVM model of categorical outcome noyes deciding on the choice of kernel, cost, etc. that appear to yield better performance. Test model performance on multiple splits of data into training and test subsets, summarize model performance in terms of accuracy/error/sensitivity/specificity and compare to the performance of the rest of the models developed above (logistic regression, LDA, random forest).
A radial kernel was used for SVM classification. An initial test with cost and gamma values of 1 resulted in a training accuracy of around 98.2%. However, when we tuned the SVM fit using a random subset of data (1/20 the original dataset), we found that higher costs (around 10) and lower gammas (around 0.02) resulted in lower errors based on the tune function. 5-fold crossvalidation was performed with these tuned parameters (choosing from a subset of 1/10 of the data to save computation time). Based on the boxplots below, we find that SVM with a radial kernel performs similarly to the random forest model, with some trends shared with the logistic/LDA models as well. There was quite a bit of variation between runs, which is probably due to just using a subset of the data to get us in the general ballpark of the acutal results and save on computation time. SVM has pretty much identical accuracy/error compared to random forest and is also comparable to LDA and logistic models. There was a boost in sensitivity (falling between LDA/logistic and random forest), at least for the median value, but there is a lot of variation. Due to the uncertainly of the broad boxplot distributions, there does not seem to be a clear winner.
# Following commented out for faster runtime
# Initial test: fit SVM model without id and og, as in random forest, using cost and gamma of 1
# svmRes <- svm(noyes~.-id-og, data=trainData, kernel="radial", scale=T, cost=1, gamma=1)
# predicted6 = predict(svmRes, newdata=trainData[, -1])
# assess.prediction(true, predicted6) # Assess prediction => 98.2% accuracy
# Tuning cost and gamma
# Split data for faster run-time
# subset = sample( (1:nrow(trainData)) %% 20 ) # Randomly assign subset with 1/20 of data for faster runtime
# data.subset = trainData[subset == 5,]
# fit.svm.final = tune(svm, noyes~.-id-og, data=data.subset, kernel="radial", scale=T,
# ranges=list(cost=c(1,5,10), gamma=c(0.02,0.05,0.1)))
# fit.svm.final # Optimized for cost=10 and gamma=0.02
# predicted.svm.final = predict(fit.svm.final$best.model, newdata=trainData[ , -1])
# Accuracy, error rate, sensitivity, and specificity of model
# assess.prediction(true, predicted.svm.final) # Assess prediction => 86.4% accuracy
# Test model on train/test splits
# k-fold crossvalidation
N.xval = 5
N.iter = 5
for ( iter in 1:N.iter ) {
subset = sample( (1:nrow(trainData)) %% 10 ) # Randomly assign subset with 1/10 of data for faster runtime
data.subset = trainData[subset == 5,]
grps = sample( (1:nrow(data.subset)) %% N.xval+1 ) # Randomly assign groups
for ( i in 1:N.xval ) {
data.test = data.subset[grps == i,] # Set group i as test
data.train = data.subset[grps != i,] # Set the rest as train
# Expected results from train/test predictions
train.true = data.train$noyes
test.true = data.test$noyes
# Results for fit
fit = svm(noyes~.-id-og, data=data.train, kernel="radial", scale=T, cost=10, gamma=0.02)
train.predicted = predict(fit, newdata=data.train[, -1])
test.predicted = predict(fit, newdata=data.test[, -1])
# Calculate and store accuracy/error/sensitivity/specificity
test.results = assess.prediction(test.true, test.predicted, verbose=F)
results = rbind(results, data.frame(accuracy = test.results$accuracy, error = test.results$error,
sensitivity = test.results$TPR, specificity = test.results$TNR, model="SVM"))
}
}
# Plot results from cross-validation
plot1 = ggplot(results, aes(y=accuracy, col=model)) + geom_boxplot() + ylab("Accuracy (%)") +
ggtitle("Accuracy of SVM") + theme(plot.title = element_text(size=12, hjust = 0.5))
plot2 = ggplot(results, aes(y=error, col=model)) + geom_boxplot() + ylab("Error (%)") +
ggtitle("Error of SVM") + theme(plot.title = element_text(size=12, hjust = 0.5))
plot3 = ggplot(results, aes(y=sensitivity, col=model)) + geom_boxplot() + ylab("Sensitivity (%)") +
ggtitle("Sensitivity of SVM") + theme(plot.title = element_text(size=12, hjust = 0.5))
plot4 = ggplot(results, aes(y=specificity, col=model)) + geom_boxplot() + ylab("Specificity (%)") +
ggtitle("Specificity of SVM") + theme(plot.title = element_text(size=12, hjust = 0.5))
grid.arrange(plot1, plot2, plot3, plot4, nrow=2)
Compare performance of the models developed above (logistic regression, LDA, random forest, SVM) in terms of their accuracy, error and sensitivity/specificity. Comment on differences and similarities between them.
Overall, these models all fell within the same ballpark in terms of being at most around 2 to 5% away from each other. Logistic seemed to have the most consistently high accuracy, while its sensitivity and specificity were more middle of the pack. LDA had slight decrease in accuracy/increase in error (less than around 0.5%), lost ground on sensitivity, but gained it in specificity. You can decide on which model is more preferable depending on whether your goal is to achieve better sensivity or better specificity. As mentioned before, random forest and SVM have more of a distribution because they are more computationally intensive so a smaller subset was chosen for testing. Even though their median accuracies are similar (sometimes higher, sometimes lower depending on the run) than LDA/logistic fits, there is a greater oppportunity for them to achieve accuracies as high as 90%. Compared to LDA and logistic fits, random forest and SVM tend to have higher sensivities, with some loss in some specificity as a consequence.
Decide on the model that performs the best and use it to make predictions for the test dataset. This is the dataset that is provided separately from training data without the outcome noyes that we are modeling here. Upload resulting predictions in comma-separated values (CSV) format into the Canvas website. Please check sample files with test dataset predictions for the expected format of the *.csv file with predictions: your submission must be in precisely the same format – two and only two columns, first column - ids of the test observations (“id” column in test dataset), second - predictions as yes/no calls (not 0/1, 1/2, true/false, etc.). The name of the second column of predictions is what will be used in leaderboard as its name.
I decided that the random forest model with “id” and “og” removed was the best due to its heightened sensivitiy, which is the metric that needs the most improvement that the others were weaker on. Using the fit from before in Problem 4, I obtained some dataset predictions for the test data and saved them to a csv file.
# Decided to proceed with random forest as the final model (specifically the reduced model with id and og removed)
test.predictions = predict(rf.reduced, newdata=testData[,-c(1,5)])
testPredictions = cbind(data.frame(id=testData$id, randomGuess=test.predictions))
write.csv(testPredictions, file = "predictions-aw.csv", row.names=F)
This is not really a problem per se but rather a criterion that we will go by when assessing quality of your predictions for the test dataset. You get these four points if your predictions for test dataset are better than those obtained from a fair coin flip (already shown in leaderboard and as examples of the file format for predictions upload) by at least 10% on all four metrics shown in the leaderboard (accuracy, sensitivity, specificity and precision). But then predictions by the coin flip should not be very difficult to improve upon.